library(tidyverse)
library(patchwork)
# to install the superheat library run the following code:
# library(devtools)
# devtools::install_github("rlbarter/superheat")
library(superheat)[Chapter 6] Exploring the nutrition data
[DSLC stages]: EDA
We examined and cleaned the nutrition data in the file 01_cleaning.qmd. In this document, you will find the code for conducting an EDA of the nutrition data. This document is far from exhaustive and you are encouraged to add some sections to conduct your own explorations of this data.
In each code file that uses the cleaned version of the data, it is good practice to load in the original “raw” (uncleaned) data and then clean it using the cleaning function you wrote. It is often helpful to keep a copy of the original uncleaned data in your environment too.
First we will load in the libraries that we will use. Note that if you don’t have superheat, we recommend downloading it from github using the devtools code that is commented out, rather than from CRAN (the CRAN version is outdated).
Next, we will load the cleaning and pre-processing functions that we wrote.
# source the cleaning function code
source("functions/cleanFoodData.R")
source("functions/preProcessFoodData.R")And load in the raw data objects and clean them. We will create a cleaned version of both the FNDDS and Legacy datasets. We will also pre-process each dataset, by log-transforming and SD-scaling them (we won’t center them).
# load in all of the raw data objects
nutrient_amount <- read_csv("../data/food_nutrient.csv")Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 5370480 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): id, fdc_id, nutrient_id, amount, data_points, derivation_id, min, m...
lgl (2): median, footnote
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
food_name <- read_csv("../data/food.csv")Rows: 335142 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): data_type, description
dbl (2): fdc_id, food_category_id
date (1): publication_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nutrient_name <- read_csv("../data/nutrient_name.csv")Rows: 187 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): nutrient_name
dbl (1): nutrient_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Clean the FNDDS data
food_fndds <- cleanFoodData(.nutrient_amount_data = nutrient_amount,
.food_name_data = food_name,
.nutrient_name_data = nutrient_name,
# fndds is the default value
.select_data_type = "survey_fndds_food")
# Preprocess the FNDDS data
food_fndds_log_scaled <- preProcessFoodData(food_fndds,
.log_transform = TRUE,
.center = FALSE,
.scale = TRUE,
.remove_fat = FALSE)
# Clean the Legacy data
food_legacy <- cleanFoodData(.nutrient_amount_data = nutrient_amount,
.food_name_data = food_name,
.nutrient_name_data = nutrient_name,
.select_data_type = "sr_legacy_food")
# Preprocess the Legacy data
food_legacy_log_scaled <- preProcessFoodData(food_legacy,
.log_transform = TRUE,
.center = FALSE,
.scale = TRUE,
.remove_fat = FALSE)1 High-level summary of the data
Here are some histograms of the nutrients.
Ironically, the easiest way to plot these using ggplot is to first revert the data back to a long format.
food_fndds |>
# convert the data to a long format (ignoring the first description column)
pivot_longer(2:ncol(food_fndds),
names_to = "nutrient",
values_to = "amount") |>
# plot histograms
ggplot() +
geom_histogram(aes(x = amount)) +
facet_wrap(~nutrient, scales = "free", ncol = 3)`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distributions of each of these nutrients is fairly skewed.
After log-transforming and standardizing these nutrients however, some of them look substantially more symmetrical. Some of the nutrients that were particularly skewed are still fairly skewed.
food_fndds_log_scaled |>
pivot_longer(2:ncol(food_fndds),
names_to = "nutrient",
values_to = "amount") |>
ggplot() +
geom_histogram(aes(x = amount)) +
facet_wrap(~nutrient, scales = "free", ncol = 3)`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The following table shows a random sample of 15 food items.
set.seed(3243581)
food_fndds |>
sample_n(15) |>
arrange(description)| description | protein | fat | carbohydrates | calories | alcohol | moisture | caffeine | theobromine | total_dietary_fiber | calcium | iron | magnesium | phosphorus | potassium | sodium | zinc | copper | selenium | retinol | beta_carotene | alpha_carotene | alpha_tocopherol | cryptoxanthin | lycopene | lutein_zeaxanthin | vitamin_c | thiamine | riboflavin | niacin | vitamin_b6 | folate | vitamin_b12 | total_choline | phylloquinone | cholesterol | saturated_fat | butyric_acid | caproic_acid | caprylic_acid | capric_acid | lauric_acid | myristic_acid | palmitic_acid | stearic_acid | oleic_acid | linoleic_acid | alpha_linolenic_acid | arachidonic_acid | docosahexaenoic_acid | palmitoleic_acid | parinaric_acid | eicosenoic_gadoleic_acid | ecosapentenoic_acid | erucic_acid | docosapentaenoic_acid | monounsaturated_fat | polyunsaturated_fat |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bread, marble rye and pumpernickel, toasted | 9.34 | 3.63 | 53.08 | 285 | 0 | 31.10 | 0 | 0 | 6.4 | 80 | 3.11 | 44 | 137 | 182 | 663 | 1.25 | 0.204 | 34.0 | 0 | 4 | 0 | 0.36 | 1 | 0 | 56 | 0.4 | 0.382 | 0.368 | 4.181 | 0.082 | 103 | 0.00 | 16.0 | 1.3 | 0 | 0.688 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.012 | 0.423 | 0.253 | 1.424 | 0.812 | 0.066 | 0.000 | 0.000 | 0.013 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 1.441 | 0.878 |
| Cheese sandwich, Cheddar cheese, on white bread, with mayonnaise | 12.97 | 22.82 | 26.75 | 366 | 0 | 34.85 | 0 | 0 | 1.4 | 333 | 1.94 | 22 | 218 | 95 | 567 | 1.72 | 0.065 | 22.0 | 121 | 32 | 0 | 0.77 | 0 | 0 | 24 | 0.0 | 0.287 | 0.283 | 2.494 | 0.070 | 68 | 0.41 | 17.7 | 20.6 | 41 | 8.605 | 0.231 | 0.194 | 0.125 | 0.299 | 0.345 | 1.099 | 4.294 | 1.716 | 4.961 | 5.886 | 0.786 | 0.026 | 0.001 | 0.202 | 0.000 | 0.057 | 0.004 | 0.003 | 0.006 | 5.690 | 6.737 |
| Hamburger, 1 medium patty, with condiments, on bun, from fast food / restaurant | 15.68 | 11.60 | 22.14 | 256 | 0 | 48.81 | 0 | 0 | 1.1 | 87 | 1.73 | 24 | 121 | 225 | 374 | 3.04 | 0.101 | 24.8 | 0 | 26 | 1 | 0.06 | 2 | 668 | 21 | 0.3 | 0.160 | 0.201 | 4.650 | 0.148 | 22 | 1.66 | 33.3 | 5.0 | 40 | 4.785 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.352 | 2.740 | 1.513 | 4.833 | 0.999 | 0.105 | 0.019 | 0.000 | 0.438 | 0.057 | 0.038 | 0.000 | 0.000 | 0.000 | 5.404 | 1.179 |
| Herring, smoked, kippered | 24.58 | 12.37 | 0.00 | 217 | 0 | 59.70 | 0 | 0 | 0.0 | 84 | 1.51 | 46 | 325 | 447 | 918 | 1.36 | 0.135 | 52.6 | 40 | 0 | 0 | 1.54 | 0 | 0 | 0 | 1.0 | 0.126 | 0.319 | 4.402 | 0.413 | 14 | 18.70 | 95.0 | 0.1 | 82 | 2.791 | 0.000 | 0.000 | 0.000 | 0.000 | 0.016 | 0.758 | 1.851 | 0.149 | 2.074 | 0.178 | 0.141 | 0.082 | 1.179 | 0.851 | 0.293 | 0.986 | 0.970 | 1.149 | 0.075 | 5.112 | 2.919 |
| Ice cream sandwich, made with light chocolate ice cream | 5.58 | 6.60 | 38.54 | 236 | 0 | 48.59 | 3 | 92 | 1.5 | 125 | 0.93 | 26 | 91 | 185 | 55 | 0.54 | 0.158 | 6.8 | 46 | 103 | 0 | 0.21 | 0 | 0 | 4 | 0.9 | 0.036 | 0.111 | 0.306 | 0.027 | 7 | 0.11 | 16.6 | 0.7 | 21 | 3.671 | 0.172 | 0.059 | 0.060 | 0.107 | 0.069 | 0.538 | 1.701 | 0.790 | 1.785 | 0.404 | 0.070 | 0.000 | 0.000 | 0.152 | 0.000 | 0.001 | 0.000 | 0.000 | 0.000 | 1.938 | 0.477 |
| Oxtail soup | 8.40 | 4.87 | 1.37 | 85 | 0 | 84.68 | 0 | 0 | 0.3 | 11 | 1.12 | 9 | 68 | 105 | 139 | 2.75 | 0.054 | 7.3 | 11 | 237 | 85 | 0.14 | 0 | 106 | 59 | 1.9 | 0.029 | 0.082 | 0.804 | 0.093 | 6 | 0.65 | 32.8 | 10.2 | 31 | 2.212 | 0.053 | 0.033 | 0.020 | 0.044 | 0.045 | 0.225 | 1.156 | 0.621 | 1.719 | 0.152 | 0.014 | 0.011 | 0.000 | 0.137 | 0.000 | 0.004 | 0.001 | 0.000 | 0.004 | 1.861 | 0.179 |
| Pancakes, plain, reduced fat, from fozen | 7.56 | 9.54 | 36.09 | 262 | 0 | 43.98 | 0 | 0 | 2.2 | 208 | 1.51 | 20 | 329 | 171 | 667 | 0.75 | 0.064 | 11.3 | 71 | 12 | 0 | 1.12 | 1 | 0 | 101 | 0.0 | 0.221 | 0.292 | 1.423 | 0.151 | 46 | 0.35 | 45.4 | 6.8 | 61 | 2.811 | 0.087 | 0.058 | 0.040 | 0.071 | 0.075 | 0.232 | 1.566 | 0.615 | 3.232 | 2.424 | 0.291 | 0.027 | 0.008 | 0.070 | 0.000 | 0.026 | 0.000 | 0.000 | 0.001 | 3.337 | 2.761 |
| Pizza, cheese, with vegetables, from restaurant or fast food, medium crust | 10.31 | 8.69 | 30.71 | 242 | 0 | 48.08 | 0 | 0 | 2.2 | 170 | 2.26 | 23 | 196 | 171 | 535 | 1.22 | 0.101 | 17.9 | 55 | 96 | 1 | 0.77 | 0 | 1714 | 74 | 4.8 | 0.354 | 0.177 | 3.455 | 0.090 | 85 | 0.38 | 15.4 | 6.5 | 15 | 3.999 | 0.090 | 0.072 | 0.049 | 0.126 | 0.157 | 0.539 | 2.059 | 0.768 | 2.139 | 1.318 | 0.159 | 0.011 | 0.000 | 0.105 | 0.002 | 0.033 | 0.004 | 0.002 | 0.004 | 2.336 | 1.514 |
| Rice, brown, with vegetables and gravy, NS as to fat added in cooking | 2.24 | 2.53 | 18.21 | 105 | 0 | 76.10 | 0 | 0 | 1.7 | 9 | 0.49 | 27 | 73 | 89 | 213 | 0.51 | 0.080 | 3.5 | 0 | 371 | 173 | 0.34 | 0 | 0 | 114 | 0.6 | 0.120 | 0.073 | 1.685 | 0.089 | 9 | 0.00 | 10.4 | 6.0 | 1 | 0.486 | 0.000 | 0.000 | 0.001 | 0.009 | 0.002 | 0.009 | 0.362 | 0.085 | 0.967 | 0.783 | 0.082 | 0.001 | 0.000 | 0.032 | 0.000 | 0.010 | 0.000 | 0.000 | 0.000 | 1.011 | 0.867 |
| Spinach, creamed, baby food, strained | 2.50 | 1.30 | 5.70 | 37 | 0 | 89.60 | 0 | 0 | 1.8 | 89 | 0.62 | 55 | 54 | 191 | 49 | 0.31 | 0.060 | 2.4 | 23 | 2456 | 0 | 0.83 | 0 | 0 | 4505 | 8.7 | 0.015 | 0.104 | 0.216 | 0.075 | 61 | 0.06 | 16.1 | 196.7 | 5 | 0.702 | 0.035 | 0.019 | 0.013 | 0.025 | 0.029 | 0.112 | 0.310 | 0.130 | 0.291 | 0.060 | 0.071 | 0.000 | 0.000 | 0.028 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.332 | 0.133 |
| Tomato rice soup, prepared with water | 0.82 | 1.06 | 8.54 | 47 | 0 | 88.52 | 0 | 0 | 0.7 | 11 | 0.31 | 2 | 13 | 129 | 319 | 0.20 | 0.055 | 0.9 | 1 | 124 | 0 | 0.86 | 0 | 8591 | 1 | 5.8 | 0.024 | 0.020 | 0.411 | 0.030 | 6 | 0.00 | 7.8 | 1.5 | 1 | 0.200 | 0.000 | 0.005 | 0.000 | 0.000 | 0.005 | 0.015 | 0.110 | 0.050 | 0.230 | 0.435 | 0.090 | 0.000 | 0.000 | 0.005 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.235 | 0.525 |
| Tomato with corn and okra, cooked, made with butter | 1.51 | 2.48 | 6.72 | 49 | 0 | 87.97 | 0 | 0 | 2.0 | 36 | 0.39 | 18 | 29 | 154 | 323 | 0.26 | 0.054 | 0.6 | 16 | 156 | 2 | 0.40 | 6 | 1047 | 336 | 10.1 | 0.284 | 0.064 | 0.818 | 0.108 | 26 | 0.00 | 10.4 | 12.3 | 5 | 1.349 | 0.079 | 0.049 | 0.029 | 0.063 | 0.063 | 0.182 | 0.604 | 0.260 | 0.614 | 0.255 | 0.015 | 0.000 | 0.000 | 0.025 | 0.000 | 0.004 | 0.000 | 0.000 | 0.000 | 0.643 | 0.275 |
| Tomato, green, pickled | 1.09 | 0.24 | 8.24 | 36 | 0 | 89.54 | 0 | 0 | 1.1 | 15 | 0.48 | 10 | 26 | 180 | 125 | 0.10 | 0.072 | 0.6 | 0 | 347 | 54 | 0.40 | 34 | 0 | 5 | 24.4 | 0.050 | 0.037 | 0.426 | 0.087 | 9 | 0.00 | 6.7 | 7.6 | 0 | 0.030 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.022 | 0.006 | 0.046 | 0.070 | 0.008 | 0.000 | 0.000 | 0.001 | 0.000 | 0.004 | 0.000 | 0.009 | 0.000 | 0.061 | 0.078 |
| Turkey with gravy, dressing, potatoes, vegetable, frozen meal | 6.97 | 3.89 | 16.32 | 128 | 0 | 71.16 | 0 | 0 | 1.3 | 27 | 0.79 | 18 | 106 | 230 | 326 | 0.48 | 0.056 | 7.9 | 34 | 124 | 0 | 0.24 | 0 | 0 | 327 | 0.8 | 0.138 | 0.252 | 3.057 | 0.177 | 24 | 0.33 | 17.1 | 11.5 | 14 | 0.877 | 0.000 | 0.000 | 0.000 | 0.007 | 0.010 | 0.050 | 0.572 | 0.237 | 1.180 | 0.965 | 0.102 | 0.020 | 0.000 | 0.055 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.235 | 1.087 |
| Vegetable combination, excluding carrots, broccoli, and dark-green leafy; cooked, no sauce, made with margarine | 2.52 | 2.06 | 11.99 | 71 | 0 | 82.26 | 0 | 0 | 2.7 | 20 | 0.71 | 19 | 53 | 163 | 275 | 0.42 | 0.067 | 0.6 | 22 | 409 | 23 | 0.70 | 31 | 0 | 907 | 12.7 | 0.097 | 0.066 | 0.845 | 0.219 | 32 | 0.00 | 17.2 | 17.8 | 0 | 0.407 | 0.000 | 0.000 | 0.000 | 0.002 | 0.006 | 0.005 | 0.234 | 0.144 | 0.618 | 0.823 | 0.099 | 0.000 | 0.000 | 0.002 | 0.000 | 0.006 | 0.000 | 0.000 | 0.000 | 0.626 | 0.922 |
Next, let’s look at a heatmap of the entire dataset. There are clearly some columns that are similar, but arranged alphabetically, it is very hard to tease apart what they are.
# create a version of the data to visualize in the heatmap
food_fndds |>
dplyr::select(-description,-caffeine, -moisture, -alcohol, -theobromine) |>
mutate_all(~(. - mean(.)) / sd(.)) |>
# convert the dataset to a long-form
pivot_longer(everything(), names_to = "nutrient", values_to = "value") |>
# create a row ID variable
group_by(nutrient) |>
mutate(id = 1:n()) |>
ungroup() |>
# plot a heatmap
ggplot() +
geom_tile(aes(x = nutrient, y = id, fill = value)) +
# choose the fill gradient
scale_fill_gradientn("Scaled nutrient value",
colors = c("white", "grey80", "grey30", "black"),
values = c(0, 0.015, 0.02, 1)) +
# do some formatting stuff
scale_y_continuous(expand = c(0, 0)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
legend.position = "top") +
labs(x = NULL, y = NULL)Let’s create an explanatory version of this heatmap that arranges the nutrient columns so that similar nutrients are grouped together. We made these grouping decisions based on our own domain knowledge as well as an initial version of this heatmap. We will also arrange the food items/rows using a “clustering” algorithm that we will introduce in Chapter 8.
Figure 2 shows the result.
# create a version of the data to visualize in the heatmap
food_heatmap <- food_fndds |>
dplyr::select(-description,-caffeine, -moisture, -alcohol, -theobromine) |>
mutate_all(~(. - mean(.)) / sd(.))
# define a vector for manually arranging the nutrient columns into nutrient groups
nutrient_order <- c(# vitamins
"vitamin_b6", "alpha_tocopherol", "riboflavin", "thiamine", "folate", "niacin", "vitamin_c", "vitamin_b12", "retinol", "beta_carotene", "lutein_zeaxanthin", "phylloquinone", "lycopene", "cryptoxanthin", "alpha_carotene",
#major minerals
"sodium", "potassium", "calcium", "phosphorus", "magnesium", "total_choline",
# trace minerals
"iron", "zinc", "selenium", "copper",
# carbohydrates
"carbohydrates", "total_dietary_fiber",
# fats
"fat", "saturated_fat", "monounsaturated_fat", "polyunsaturated_fat", "palmitic_acid", "stearic_acid", "oleic_acid", "linoleic_acid", "arachidonic_acid", "palmitoleic_acid", "alpha_linolenic_acid", "eicosenoic_gadoleic_acid", "cholesterol", "butyric_acid", "caproic_acid", "caprylic_acid", "capric_acid", "myristic_acid", "lauric_acid", "docosapentaenoic_acid", "docosahexaenoic_acid", "parinaric_acid", "ecosapentenoic_acid", "erucic_acid",
# calories
"calories",
# protein
"protein")
# do a fancy clustering to decide the order of the rows/food items
# you'll learn this in chapter 8
food_order <- hclust(dist(food_heatmap))$order
food_heatmap[food_order, nutrient_order] |>
# convert the dataset to a long-form
pivot_longer(everything(), names_to = "nutrient", values_to = "value") |>
# ensure that the nutrient columns appear in the correct order by
# reordering the factor levels
mutate(nutrient = fct_inorder(nutrient)) |>
# create a row ID variable
group_by(nutrient) |>
mutate(id = 1:n()) |>
ungroup() |>
# plot a heatmap
ggplot() +
geom_tile(aes(x = nutrient, y = id, fill = value)) +
# add some boundary lines
geom_vline(xintercept = c(0.5, 15.5, 21.5, 25.5, 27.5, 51.5, 52.5, 53.5),
color = "grey20", linewidth = 1.2) +
geom_hline(yintercept = c(0, nrow(food_heatmap)),
color = "grey20", linewidth = 1.2) +
# choose the fill gradient
scale_fill_gradientn("Scaled nutrient value",
colors = c("white", "grey80", "grey30", "black"),
values = c(0, 0.015, 0.02, 1)) +
# do some formatting stuff
scale_y_continuous(expand = c(0, 0)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
legend.position = "top") +
labs(x = NULL, y = NULL)1.1 PCS evaluation of heatmap exploration
Predictability
Let’s see if these same patterns exist in the Legacy food dataset. Below, we recreate the heatmap using the Legacy dataset. Note that we are coloring missing values blue.
# create a version of the data to visualize in the heatmap
food_legacy_heatmap <- food_legacy |>
select(one_of(colnames(food_fndds))) |>
dplyr::select(-description,-caffeine, -moisture, -alcohol, -theobromine) |>
mutate_all(~(. - mean(., na.rm = TRUE)) / sd(., na.rm = TRUE))
# do a fancy clustering to decide the order of the rows/food items
# you'll learn this in chapter 8
# impute missing values with mean
food_legacy_heatmap_imputed <- food_legacy_heatmap |> mutate(across(where(is.numeric),
~if_else(is.na(.), mean(., na.rm = TRUE), .)))
food_legacy_order <- hclust(dist(food_legacy_heatmap_imputed))$order
food_legacy_heatmap[food_legacy_order, nutrient_order] |>
# convert the dataset to a long-form
pivot_longer(everything(), names_to = "nutrient", values_to = "value") |>
# ensure that the nutrient columns appear in the correct order by
# reordering the factor levels
mutate(nutrient = fct_inorder(nutrient)) |>
# create a row ID variable
group_by(nutrient) |>
mutate(id = 1:n()) |>
ungroup() |>
# plot a heatmap
ggplot() +
geom_tile(aes(x = nutrient, y = id, fill = value)) +
# add some boundary lines
geom_vline(xintercept = c(0.5, 15.5, 21.5, 25.5, 27.5, 51.5, 52.5, 53.5),
color = "grey20", linewidth = 1.2) +
geom_hline(yintercept = c(0, nrow(food_legacy_heatmap)),
color = "grey20", linewidth = 1.2) +
# choose the fill gradient
scale_fill_gradientn("Scaled nutrient value",
colors = c("white", "grey80", "grey30", "black"),
values = c(0, 0.015, 0.02, 1), na.value = "lightsteelblue1") +
# do some formatting stuff
scale_y_continuous(expand = c(0, 0)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
legend.position = "top") +
labs(x = NULL, y = NULL)Since the food items between the FNDDS and Legacy datasets are different, we don’t expect to see the exact same patterns, but we do see that columns that had similar patterns in the FNDDS dataset also seem to have similar patterns in the Legacy dataset.
Stability to transformation judgment call
Let’s investigate the stability of this heatmap result to the judgment call of log-transforming the data (the original version was scaled and centered, but not log-transformed).
When we use the log-transformed scaled and centered data, the trends are fairly similar. Note, however, that for both plots we had to do a lot of finagling with the scale_fill_gradientn() values argument that controls the positions of the color transition.
# create a version of the data to visualize in the heatmap
food_heatmap_log <- food_fndds_log_scaled |>
dplyr::select(-description,-caffeine, -moisture, -alcohol, -theobromine) |>
mutate_all(~. - mean(.))
food_heatmap_log[food_order, nutrient_order] |>
# convert the dataset to a long-form
pivot_longer(everything(), names_to = "nutrient", values_to = "value") |>
# ensure that the nutrient columns appear in the correct order by
# reordering the factor levels
mutate(nutrient = fct_inorder(nutrient)) |>
# create a row ID variable
group_by(nutrient) |>
mutate(id = 1:n()) |>
ungroup() |>
# plot a heatmap
ggplot() +
geom_tile(aes(x = nutrient, y = id, fill = value)) +
# add some boundary lines
geom_vline(xintercept = c(0.5, 15.5, 21.5, 25.5, 27.5, 51.5, 52.5, 53.5),
color = "grey20", linewidth = 1.2) +
geom_hline(yintercept = c(0, nrow(food_heatmap)),
color = "grey20", linewidth = 1.2) +
# choose the fill gradient
scale_fill_gradientn("Scaled nutrient value",
colors = c("white", "grey80", "grey30", "black"),
values = c(0, 0.15, 0.17, 1)) +
# do some formatting stuff
scale_y_continuous(expand = c(0, 0)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
legend.position = "top") +
labs(x = NULL, y = NULL)2 Correlation exploration
Prompted by the heatmaps above, we next ask which nutrient variables are correlated with one another.
The heatmap below in Figure 5 (created using superheat() this time) displays the correlation matrix of the nutrients. The pretty.order.rows = TRUE and pretty.order.cols = TRUE arguments tell superheat() to use a clustering algorithm (see Chapter 8) under-the-hood to order the rows/columns so that similar rows/columns will appear next to one another.
It is even clearer from this figure that there are some groups of highly correlated nutrients.
superheat(cor(select(food_fndds, -description, -caffeine,
-moisture, -alcohol, -theobromine)),
heat.pal = c("#F6AE2D", "white", "#18678B"),
heat.pal.values = c(0, 0.16, 1),
bottom.label.text.angle = 90,
bottom.label.size = 0.2,
bottom.label.text.size = 4,
bottom.label.text.alignment = "right",
left.label.size = 0.2,
left.label.text.size = 4,
left.label.text.alignment = "right",
pretty.order.rows = TRUE,
pretty.order.cols = TRUE)The question is, how similar are these groups to the nutrient groups that we identified in the previous section?
2.1 PCS evaluations of correlation matrix
Predictability
Let’s see if the same relationships hold in the Legacy food dataset. This time, however, rather than keeping rows with missing values, we remove them for now since correlations can’t be computed for data with missing values. Note that this may introduce bias in the data, but for now we won’t worry too much about it (but we will be aware of it).
The groupings seem to be very clear in this Legacy dataset too.
food_legacy |>
select(one_of(colnames(food_fndds)), -description, -caffeine,
-moisture, -alcohol, -theobromine) |>
drop_na() |>
cor() |>
superheat(heat.pal = c("#F6AE2D", "white", "#18678B"),
heat.pal.values = c(0, 0.2, 1),
bottom.label.text.angle = 90,
bottom.label.size = 0.2,
bottom.label.text.size = 4,
bottom.label.text.alignment = "right",
left.label.size = 0.2,
left.label.text.size = 4,
left.label.text.alignment = "right",
pretty.order.rows = TRUE,
pretty.order.cols = TRUE)Stability to visualization judgment calls
As a quick stability evaluation, we will arrange the columns and rows of the correlation matrix in the same order as the heatmaps in Figure 1.
superheat(cor(select(food_fndds, -description,
-caffeine, -moisture,
-alcohol, -theobromine))[nutrient_order, nutrient_order],
heat.pal = c("#F6AE2D", "white", "#18678B"),
heat.pal.values = c(0, 0.16, 1),
bottom.label.text.angle = 90,
bottom.label.size = 0.2,
bottom.label.text.size = 4,
bottom.label.text.alignment = "right",
left.label.size = 0.2,
left.label.text.size = 4,
left.label.text.alignment = "right")There are still some very clear nutrient groupings in the heatmap when the nutrients are ordered based on our manual ordering defined above.
Stability to pre-processing judgment call
The correlations are generally larger after log-transforming the nutrient data. The groupings are still there but are a little bit more merged together, with the exception of the vitamins that still seem fairly distinct.
superheat(cor(select(food_fndds_log_scaled, -description,
-caffeine, -moisture,
-alcohol, -theobromine)),
heat.pal = c("#F6AE2D", "white", "#18678B"),
heat.pal.values = c(0, 0.28, 1),
bottom.label.text.angle = 90,
bottom.label.size = 0.2,
bottom.label.text.size = 4,
bottom.label.text.alignment = "right",
left.label.size = 0.2,
left.label.text.size = 4,
left.label.text.alignment = "right",
pretty.order.rows = TRUE,
pretty.order.cols = TRUE)3 Defining nutrient groups
Based on these plots and our own nutrient domain knowledge, we will define the following groups of nutrients:
Vitamins: vitamin C, vitamin B6, vitamin B12, riboflavin, thiamine, folate, niacin, beta carotene, alpha carotene, lutein zeaxanthin, phylloquinone, alpha tocopherol, retinol, lycopene, cryptoxanthin
Fats: fat, saturated fat, monounsaturated fat, polyunsaturated fat, cholesterol, and all of the fatty acids
Major minerals: sodium, potassium, calcium, phosphorus, magnesium, total choline
Trace minerals: iron, zinc, selenium, copper
Carbohydrates: carbohydrates, total dietary fiber
Calories: calories
Protein: protein
We chose to ignore the variables “moisture”, “alcohol”, “caffeine”, and “theobromine”, since we deemed these not particularly important for making nutritional dietary choices.
Note that these groupings are very clearly a judgment call, and we could certainly have defined them differently!
4 Exploring the relationship between the food items
The above explorations gave us a vague sense of the relationships between the nutrients (we will try to formalize this in the PCA analysis file), but we don’t yet have a sense of the food items themselves.
One way to explore the food items might be to take two nutrient variables (such as sodium and potassium) and visualize the food items in the space defined by these food items.
The points in the sodium-potassium scatterplot in Figure 9 clearly exhibit a large amount of overlap (despite adding transparency with the alpha argument), but adding the text of a few food items helps tease apart some relationships in the food items. For instance, it seems that many cheese-related items (high in sodium, moderate in potassium) and fish-related food items (high in both sodium and potassium) are grouped close together.
We are visualizing the log-transformed standardized data because the points are more spread out in this transformed space and so the trends are slightly easier to pick out.
food_fndds_log_scaled |>
ggplot() +
geom_vline(xintercept = 0, col = "grey70") +
geom_hline(yintercept = 0, col = "grey70") +
geom_point(aes(x = sodium, y = potassium),
alpha = 0.2, col = "grey50") +
geom_text(aes(x = sodium, y = potassium,
label = str_trunc(description, 15)),
check_overlap = TRUE, hjust = 0) Let’s look at a few other pairs of variables too. To make it easy for ourselves, we can define the following reusable function. Note that the {{.}} notation in the function allows us to pass unquoted variable names as arguments in the tidyverse style (this is called “tidy evaluation”).
plotNutrientScatter <- function(var1, var2) {
food_fndds_log_scaled |>
ggplot() +
geom_vline(xintercept = 0, col = "grey70") +
geom_hline(yintercept = 0, col = "grey70") +
geom_point(aes(x = {{ var1 }}, y = {{ var2 }}),
alpha = 0.2, col = "grey50") +
geom_text(aes(x = {{ var1 }}, y = {{ var2 }},
label = str_trunc(description, 15)),
check_overlap = TRUE, hjust = 0)
}When we compare fat and vitamin C values, we see groups of fruits/vegetables (high in vitamin C, low in fat), groups of cereals (high in vitamin C and medium in fat), and groups of milk products (low in vitamin C and low in fat - probably these are mostly “fat free” milk products) standing out.
plotNutrientScatter(fat, vitamin_c)When we compare alpha carotene and protein, we see meats stand out as high in both, and milk products stand out as notably low in alpha carotene (but varied in protein).
plotNutrientScatter(alpha_carotene, protein)5 Comparing the FNDDS and Legacy data
If we’re going to use the SR legacy food dataset for conducting predictability assessments, we should explicitly compare this dataset to the FNDDS dataset.
A simple check is to compare the distributions of various nutrients. To compare the distributions of protein, fat, and sodium, we use side-by-side boxplots. Overall, the distributions look fairly similar, but are clearly not identical.
These differences could be a difference in the way the nutrients are measured, or it could be a difference in the types of food items that are included in each of the datasets.
# generate a vector of common nutrients
common_vars <- colnames(food_fndds)[colnames(food_fndds) %in% colnames(food_legacy)]
food_common <- rbind(data.frame(dataset = "fndds", food_fndds[, common_vars]),
data.frame(dataset = "legacy", food_legacy[, common_vars]))
gg_box_iron <- food_common |>
ggplot() +
geom_boxplot(aes(x = dataset, y = iron)) +
ggtitle("Iron")
gg_box_protein <- food_common |>
ggplot() +
geom_boxplot(aes(x = dataset, y = protein)) +
ggtitle("Protein")
gg_box_fat <- food_common |>
ggplot() +
geom_boxplot(aes(x = dataset, y = fat)) +
ggtitle("Fat")
gg_box_sodium <- food_common |>
ggplot() +
geom_boxplot(aes(x = dataset, y = sodium)) +
ggtitle("Sodium")
(gg_box_protein + gg_box_fat) / (gg_box_iron + gg_box_sodium)Warning: Removed 80 rows containing non-finite values (`stat_boxplot()`).
Warning: Removed 84 rows containing non-finite values (`stat_boxplot()`).
Let’s also compare the nutrient values of a few specific food items. For instance, here are the nutrient values of various nonfat Greek yogurt.
food_common |>
filter(str_detect(tolower(description), "yogurt"),
str_detect(tolower(description), "greek"),
str_detect(tolower(description), "fat"),
str_detect(tolower(description), "plain"),
str_detect(tolower(description), "low")) |>
transmute(dataset, description = str_trunc(description, 30),
protein, fat, carbohydrates, calories, calcium, sodium) |>
arrange(dataset)| dataset | description | protein | fat | carbohydrates | calories | calcium | sodium |
|---|---|---|---|---|---|---|---|
| fndds | Yogurt, Greek, low fat milk… | 9.95 | 1.92 | 3.94 | 73 | 115 | 34 |
| legacy | Yogurt, Greek, plain, lowfat | 9.95 | 1.92 | 3.94 | 73 | 115 | 34 |
food_common |>
filter(str_detect(tolower(description), "soup"),
str_detect(tolower(description), "minestrone"),
str_detect(tolower(description), "can")) |>
transmute(dataset, description = str_trunc(description, 30),
protein, fat, carbohydrates, calories, calcium, sodium) |>
arrange(dataset)| dataset | description | protein | fat | carbohydrates | calories | calcium | sodium |
|---|---|---|---|---|---|---|---|
| fndds | Minestrone soup, reduced so… | 2.00 | 0.80 | 9.00 | 50 | 20 | 215 |
| fndds | Minestrone soup, canned, pr… | 1.74 | 1.03 | 4.60 | 34 | 16 | 261 |
| legacy | Soup, minestrone, canned, c… | 3.48 | 2.05 | 9.17 | 68 | 28 | 516 |
| legacy | Soup, minestrone, canned, c… | 2.13 | 1.17 | 8.64 | 53 | 25 | 288 |
| legacy | Soup, minestrone, canned, p… | 1.77 | 1.04 | 4.66 | 34 | 14 | 254 |
| legacy | Soup, minestrone, canned, r… | 2.00 | 0.80 | 9.00 | 50 | 20 | 215 |
food_common |>
filter(str_detect(tolower(description), "pecan"),
str_detect(tolower(description), "pie")) |>
transmute(dataset, description = str_trunc(description, 30),
protein, fat, carbohydrates, calories, calcium, sodium) |>
arrange(dataset)| dataset | description | protein | fat | carbohydrates | calories | calcium | sodium |
|---|---|---|---|---|---|---|---|
| fndds | Pie, pecan | 4.5 | 16.69 | 59.61 | 407 | 22 | 275 |
| fndds | Pie, pecan, individual size… | 4.5 | 16.69 | 59.61 | 407 | 22 | 275 |
| legacy | Pie, pecan, prepared from r… | 4.9 | 22.20 | 52.20 | 412 | 32 | 262 |
| legacy | Pie, pecan, commercially pr… | 4.5 | 16.69 | 59.61 | 407 | 22 | 275 |
Overall, since we would expect some differences in the food items that people might use our app on and the underlying data in any of these datasets, these differences across the different datasets feel in line with the differences we would expect from the future data that our users would be inputting into our app.
We’re going to stop here, but feel free to conduct any additional explorations that occur to you!